Purdue CMS Tier2 Center: Use_of_GPU

Home » User Information » Use_of_GPU_Resources

Use of GPU Resources

Currently, CMS users of the Purdue Tier-2 center have access to the following GPU resources:

Gilbreth: The GPU-enabled Community Cluster

Users with approved access can log into the Front-End machines and use the available CPU+GPU resources interactively for developing/prototyping/testing their code and algorithms. The User Guide provides detailed instructions for all tasks.

NOTE: The front-end machines, as shared resources, are not intended for running long, resource intensive (GPU, CPU, RAM) jobs. Use them to develop/test your code/algorithms, then run your long jobs in the batch system.

In addition, users can submit SLURM batch jobs for their large-scale production jobs, using the standard submission techniques.

CMS has a dedicated job queue (referred by SLURM as 'account') called 'cms'. This provides two dedicated batch slots for long-running jobs. These are shared between all CMS users, so wait times will vary depending on the load of the systems and the number of jobs in the queue, but in general should start a new job within a 12-hours window. Here's an example of submitting a job in the 'cms' queue:
```
sbatch -A cms --time=11:59:00 -N1 -n1 --gres=gpu:1 --cpus-per-task=8 --mem=64G my_slurm_job.sh
```

For debugging purposes it is usually much faster to submit the job to the 'debug' queue instead of 'cms'. Jobs in the 'debug' queue start much faster, but cannot take than 30 minutes of walltime. Here's an example:
```
sbatch -A debug --time=00:29:00 -N1 -n1 --gres=gpu:1 --mem=64G --constraint=V100 test_slurm_job.sh
```
Note the additional --constraint=V100 parameter! We use it here because the 'debug' queue submits jobs to all Gilbreth nodes, and only some of them have V100 GPUs (the others have P100s). So, if you want your job to run on the same type of GPU that you will later use for your production jobs in the 'cms' queue (which only submits to V100 GPU nodes), it is better to request this explicitly during testing. Of course, if your job is not so GPU-type specific, you can omit this constraint, and submit your debug job to whatever slot is available first.
A very useful application of the 'debug' queue is for running short interactive batch jobs. That gives you access to a full node (unlike the shared Front-Ends), so you can use all the RAM+CPU+GPU resources of the node, thus giving you the opportunity to test your code in realistic 'production' environment. Here's an example how to start such an interactive job:
```
sinteractive -A debug --time=00:29:00 -N1 -n1 --gres=gpu:1 --mem=96G --constraint=V100
```
Finally, for low-priority jobs, the Gilbreth cluster offers the shared 'partner' queue, which - similar to the 'debug' queue - can submit to all nodes, but jobs there can use up to 24h wall time. This queue only submits jobs when there are available nodes not serving the higher-priority 'owner' queues, so wait times can be much longer.

Hammer: The GPU-enabled CMS Cluster

Currently the front-end (login) nodes of the Hammer cluster are not instrumented with GPUs. This will likely change in the course of this year, but until then - please use the group of compute nodes (hammer-f) which have nVidia T4 GPUs.
(see below)
To start a short (30min) interactive SLURM job on a GPU-instrumented Hammer node fo testing purposes, try this:
```
sinteractive -p hammer-f -A debug -N1 -n1 --mem=16G --time=11:59:00 --gres=gpu:1 --constraint=T4
```

To start a nornal (12h) interactive SLURM job on a GPU-instrumented Hammer node, try this:

sinteractive -p hammer-f -A cms -N1 -n1 --mem=16G --time=11:59:00 --gres=gpu:1 --constraint=T4

To submit a batch job to a GPU node:

sbatch -A cms --partition=hammer-f --time=11:59:00 -N1 -n1 --gres=gpu:1 --cpus-per-task=8 --mem=64G myjob.sh

CMS-Connect

A convenient way of submitting condor-jobs to GPU-enabled resources is through the CMS-Connect service, as documented in a recent USCMS presentation.